Categorization of tweets for damages: infrastructure and human damage assessment using fine-tuned BERT model

Identification of infrastructure and human damage assessment tweets is beneficial to disaster management organizations as well as victims during a disaster. Most of the prior works focused on the detection of informative/situational tweets, and infrastructure damage, only one focused on human damage. This study presents a novel approach for detecting damage assessment tweets involving infrastructure and human damages. We investigated the potential of the Bidirectional Encoder Representations from Transformer (BERT) model to learn universal contextualized representations targeting to demonstrate its effectiveness for binary and multi-class classification of disaster damage assessment tweets. The objective is to exploit a pre-trained BERT as a transfer learning mechanism after fine-tuning important hyper-parameters on the CrisisMMD dataset containing seven disasters. The effectiveness of fine-tuned BERT is compared with five benchmarks and nine comparable models by conducting exhaustive experiments. The findings show that the fine-tuned BERT outperformed all benchmarks and comparable models and achieved state-of-the-art performance by demonstrating up to 95.12% macro-f1-score, and 88% macro-f1-score for binary and multi-class classification. Specifically, the improvement in the classification of human damage is promising.


INTRODUCTION
Disasters may cause monolithic destruction and sometimes create uncontrollable and unpredictable situations.Natural disasters are caused by natural phenomena like wildfires, floods, etc., and their intensities affect the proportion of lives, the environment, and the economy of an area (Koshy & Elango, 2023).During disasters, public and private organizations rely on critical and timely information to set up required operations for helping affected people.The development of web technologies enables people to use social media platforms like Twitter, etc., to exchange their views and recent happenings in their suburbs.Twitter is one of the most popular and widely used platforms that facilitate tweets up to 280 characters at maximum.More than 486 million users are active on Twitter according to recent statistics and more than 1.4 trillion tweets are posted annually (Madichetty, Muthukumarasamy & Jayadev, 2021).
For humanitarian authorities, the assessment of disaster damage is one of the critical steps to get the real situation and seriousness of damage so that services accordingly can be provided.It is a common practice during the disaster and in the aftermath that people place massive messages on Twitter related to situational information (Madichetty, 2020;Madichetty & Sridevi, 2020;Rudra et al., 2018).Therefore, the identification of damage assessment of social media posts like tweets, etc. is one of the important aspects.Several studies (Cresci et al., 2015;Nguyen et al., 2017;Priya et al., 2018;Rudra et al., 2018) addressed this task of damage assessment.Cresci et al. (2015) focused on the Italian tweets by handling damage assessment of buildings and infrastructure only and did not address human damage assessment.Likewise, Rudra et al. (2018) proposed a model for situational tweet identification and their summarization for English and Hindi tweets but did not address human damage assessment.Nguyen et al. (2017) also missed human damage assessment and important information from textual data and focused only on buildings and infrastructure damage using image data.Priya et al. (2018) developed a query-based information retrieval method for infrastructural damage assessment but did not address human damage.Moreover, few studies proposed approaches for informational vs non-informational tweet identification like the majority voting approach (Krishna, Srinivas & Prasad Reddy, 2022), and multi-model approach using image and text data (Koshy & Elango, 2023) but did not address human damage assessment.According to our knowledge, only one study, Madichetty & Sridevi (2021), focused on damage assessment for infrastructural and human damages from tweets and used a lexicon and frequency-based approach (hand-crafted) with traditional machine learning (ML) models.However, the related studies missed the utilization of language models for infrastructure and human damage assessment from tweets, in contrast, they mainly used hand-crafted features.Moreover, the performance achieved by the baseline (Madichetty & Sridevi, 2021) is not promising.To fulfill this gap, two objectives are devised in this study: 1.To design an automated damage assessment approach to identify infrastructural and human damages from tweet data using a state-of-the-art language model.2. To demonstrate the effectiveness and outperformance of the proposed automated approach in comparison with benchmarks through experimental results.
To capture the actual context of the language used to describe human and infrastructural damages in tweets, a state-of-the-art language model is required.Fine-tuning the BERT model has demonstrated robust performance in similar natural language processing (NLP) tasks (Malik, Cheema & Ignatov, 2023;Malik, Imran & Mamdouh, 2023).Therefore, we are interested in utilizing the architecture of the BERT language model with fine-tuning.The novelty of the study is threefold: First, the proposed framework is based on an automated feature generation model in contrast to hand-crafted features.Second, the fine-tuning of the BERT transformer model is performed not only for damage tweet identification but also for human and infrastructure damage detection.Third, the proposed framework delivered benchmark performance and outperformed the standard baselines and comparable models.The proposed study contributes in the following ways: 1.According to our knowledge, the fine-tuning of BERT is used first time for the identification of infrastructure and human damage assessment tweets.2. The utilization of contextual semantic embeddings helped us to handle the ambiguity and complexity issues of the seven disasters of the CrisisMMD dataset.3. The fine-tuning of BERT for binary and multi-class classification on seven disasters showcases substantial improvement in performance as compared to five benchmarks and nine comparable models.4. The optimization of hyperparameters is performed for the BERT model to handle the overfitting and catastrophic forgetting issues and to obtain benchmark performance.5.The improvement achieved by the proposed framework is proved to be statistically significant and verified by the Wilcoxon signed-ranked test method.6.An extensive set of experiments demonstrates that fine-tuned BERT achieved up to 95.12% macro f1-score for binary classification and 88% macro f1-score for multi-class classification.
The remaining part of the article is organized as follows: related work is described in 'Related Work' followed by 'Framework Methodology', in which the proposed methodology is described with fine-tuning BERT details. 'Experimental Results and Analysis' presents the dataset description, and experimental setup, and discusses results in detail.'Conclusion' concludes the research work and presents future directions.

RELATED WORK
In this section, we review the literature that addresses the issue of assessment of social media posts for various damages and disaster detection and summarization approaches.
In 2014, an automated classification approach for informative tweets was designed (Imran et al., 2014).The authors named their model ''Artificial Intelligence for Disaster Response (AIDR)'', tested it on the Pakistani Earthquake dataset, and achieved 90% area under curve (AUC).Later, Cresci et al. (2015) proposed an infrastructure damage assessment detection model for Italian tweets.Their method used SVM with a variety of linguistic features and they claimed that their approach was the first to be tested on non-English data.Then, Nguyen et al. (2017) utilized the VGG-16 vision model and bags of visual words to build a multi-class classification framework.Their model was tested on several disaster datasets and VGG-16 outperformed the bags of visual words approach.Likewise, the study Rudra et al. (2018) proposed a two-step methodology to extract situational information and then summarization for disaster tweets in English and Hindi languages.The low-level lexical and syntactic features with SVM classifier are explored.They claimed that non-English tweets are explored first time.
In 2019, the Domain-Adversarial Neural Network (DANN) model was used with VGG-19 to identify the damages from image data (Li et al., 2019).They claimed that their approach demonstrated significant performance but they did not address human damage identification.Then, several ML and deep learning (DL) models are explored with the Term Frequency-Inverse Document Frequency (TF-IDF) model to classify the informative microblog posts into multi-categories (Kumar, Singh & Saumya, 2019).The authors compared the performances of ML and DL models and the best results are reported.Another image data-based study is conducted by Imran et al. (2020).The proposed approach classified the tweets into damage or non-damage and then categorized the tweets based on severity like severe vs mild vs non-damage but they did not address human damage identification.Later, an information retrieval approach is utilized to assess infrastructure damage tweets (Priya et al., 2020).They developed topic topic-aligned query expansion method and evaluated it on several disaster datasets.Similarly, Alam, Ofli & Imran (2020) analyzed the situational characteristics of three hurricane disasters and developed a multi-class classification model using random forest.Their findings revealed that both text and image data contain important information.
In 2021, Alam et al. (2021) attempted to combine various crisis datasets to facilitate binary and multi-class classification.The authors used the convolutional neural network (CNN) model with FastText embeddings to explore the impacts of these approaches and conclusions are drawn.As noted earlier, there is only one study, Madichetty & Sridevi (2021), that addressed the issue of infrastructure and human damage collectively from tweets.We chose this study as one of the baselines.The authors used lexicons and TF-IDF features with six ML models and revealed that their framework outperformed the baselines.Later, a majority voting-based approach is presented in Krishna, Srinivas & Prasad Reddy (2022) to identify only informative tweets using word2vec, TF-IDF, and the Glove model.Their model showed significant performance but they did not handle infrastructure and human damages.A real-time damage assessment of tweets using image data is performed by Imran et al. (2022).The authors developed a system using computer vision models and determine the severity of damages.Later, Koshy & Elango (2023) derived an approach for informative tweet identification using the Robustly Optimized BERT (RoBERTa) model with bidirectional long short term memory (bi-LSTM) on textual and image data.Their results demonstrated the importance of their binary model.
Recently, another DL-based approach is presented by Paul, Sahoo & Balabantaray (2023) to classify disaster tweets into binary and multi-class categories.The authors used CNN, GRU, and SkipCNN models, and their model showed significant improvement.Then, the authors Alam et al. (2023) proposed a multi-task learning framework using image data.They released a dataset and conducted binary and multi-class classification tasks using DL models.Their model showed significant performance.Likewise, Lv, Wang & Shao (2023) built an auto-encoder-based model for classifying crisis-related tweets.Textual and image data are used to test the model and their model outperformed the benchmark.Then, a disaster summarization method was proposed by Garg, Chakraborty & Dandapat (2023) using the ontology technique and they tested their model on twelve disaster datasets.Their model outperformed the baselines.A recent multi-class classification approach for disaster tweets is presented Asinthara, Jayan & Jacob (2023) and they used TF-IDF and word2vec features with SVM and Bi-LSTM models.Thei r results show that with the SVM model, the performance is significant.Likewise, the identification of high-priority tweets is performed by Arathi & Sasikala (2023).Their model used Glove embeddings and metadata features with the random forest model and achieved 91% accuracy and 94% f1-score.
More recently, Dasari, Gorla & Prasad Reddy (2023) built a classification system for the detection of informative tweets.A stacking ensemble model is proposed and is used with TF-IDF, word2vec, and Golve feature models.Their model demonstrated better performance than baselines.For informative tweet categorization, the latest approach used ontology infused DL model (Giri & Deepak, 2023).They tested their approach on the image and textual dataset and claimed that their model presented benchmark performance.Likewise, Madichetty & Madisetty (2023) developed a detection pipeline for multi-modal disaster tweets.They utilized RoBERTa, and VGG-16 models for feature extraction and combined their output using a fusion method.Their model outperformed the baselines.Another study handled the classification of disaster tweets by exploring bag-of-words and several ML models (Iparraguirre-Villanueva et al., 2023).The highest performance achieved is 87% accuracy.
The summary of prior approaches related to damage assessment is presented in Table 1.In contrast, some latest approaches focused on the issue of identification of emergency messages relevant to first responders like the study Powers et al. (2023) proposed a framework to identify emergency tweets and then categorized them according to relevancy and urgency level.The authors used BERT and XLNet transformers with the CNN model and their model showed promising performance.We found the following limitations in the literature regarding damage assessment of disaster tweets: • Lack of human damage assessment: To the best of our knowledge, only one study has focused on the assessment of human damage as well as infrastructure damage.
• Effective feature engineering: Most of the studies used linguistic, syntactic, and frequency-based features (hand-crafted), but missed language models and their finetuning.

FRAMEWORK METHODOLOGY
In this section, the detail of the proposed framework is described.At the first level, the framework performs the detection of damaged or not-damaged tweets.At the second level, it further classifies the damage tweets into infrastructure damage or human damage.
The pipeline of the proposed framework is presented in Fig. 1.The CrisisMMD dataset is preprocessed by applying several steps (Section 'Data Pre-processing'), and then the dataset is split into an 80-20 ratio (80% training and 20% testing).After that, the dataset is transformed into a specified format so that it can be used as an input to transformer mode.Then the fine-tuning of BERT is performed using the grid search technique for hyper-parameters optimization.The results of the comparable models and state-of-the-art baselines are generated and compared with the fine-tuned BERT model.In the end, the conclusions are drawn.The pseudo-code of the proposed methodology is presented in the Fig. 2.

Data pre-processing
The following pre-processing steps are employed before providing data to the fine-tuning process and extraction of TF-IDF, and word2vec features.1. Removal of hashtags, HTML tags, mentions, punctuations, URLs, and numbers.2. Conversion of tweets to lowercase.3. Replacement of the emoji/emoticons with their corresponding text.4. Fixing the issue of misspelled words.5. Decoding of abbreviations (thnx, thx, btw, pls, plz etc.).6. Removal of stop words (Only for TF-IDF and word2vec).

Fine-tunning BERT
The BERT model was introduced by Devlin et al. (2018) at Google Lab and it has proven its significance for a variety of text-mining tasks in several application domains (Malik, Imran & Mamdouh, 2023).The benefits of BERT include faster development, automated feature generation, reduced data requirements, and improved performance.It has two architectures and we are interested in fine-tuning the pre-trained BERT model for damage assessment tweet identification task for binary as well as multi-class classification.The BERT model is pre-trained on a large corpus of English data in a self-supervised fashion and uses the context of language in both directions.Furthermore, BERT was pre-trained on next-sentence prediction and masked language modeling objectives.
To fine-tune the BERT base uncased model (https://huggingface.co/bert-base-uncased), some important steps are required.After applying the above-mentioned pre-processing steps, data transformation and training classifier steps are executed.We have chosen 64 and 128 sequence lengths because a maximum of 280 characters are allowed in a tweet and 128 sequence length is enough to handle the most lengthy tweets.Therefore, all tweets are padded up to the length of 64 and 128.After that, attention masks are added to locate real and padded tokens.The vector output of attention masks is then fed to the BERT model and fine-tuning step is performed.
BERT classifier training: There are seven disasters in the CrisisMMD dataset.For training and validation of BERT classifiers, we split each disaster into 80-20 ratios using the stratified sampling approach.After that, we took 80% data from each disaster and combined them to make the training dataset.The remaining 20% of data from each disaster is used for testing the BERT classifier on that specific disaster data.Furthermore, the combined 80% data is further divided into a 90-10 split, in which 90% is used for training and 10% is used for validation.We utilize the BERT base model which contains 12 transfer layers, 12 attention heads, and 768 hidden layers.All entities (class labels, token ids, and attention masks) are combined into one set.
For the classification of damage tweets, we attach the outputs of BERT (after finetunning) with an additional layer consisting of Softmax classifier as shown in Fig. 3.We denote Ti as the final hidden vector for ith token and h as a final hidden vector of [CLS] The detail of hyperparameters is presented in Table 2.

Catastrophic forgetting:
The literature demonstrates that while fine-tuning a language model to learn new knowledge, previously learned knowledge may be lost because we unfreeze weights.Researchers Sun et al. (2019) call it catastrophic forgetting in transfer learning and every transformer model is prone to this effect.A range of learning rates is explored to get insights and examine the effects of learning rate on catastrophic forgetting while fine-tuning BERT.The learning rates are 1e−4, 1e−5, 2e−5, 3e−4, 3e−5, 5e−5 respectively.In the training process, all layers of BERT are unlocked so that weights can be updated in all layers during the fine-tuning cycle.After repeating the training process several times and careful monitoring, it allows us to select our starting learning rate.We concluded that fine-tuning with higher learning rates (3e−4, 3e−5, and 5e−5) could lead to convergence failure.The best performance was observed with a learning rate of 2e−5 and this lessens the risk of catastrophic forgetting in fine-tuning.
Overfitting: How to choose the appropriate number of epochs for fine-tuning?It is a common issue for fine-tuning the transformers and deep learning models.So many epochs result in overfitting problems whereas very few may cause under-fitting.There are several methods for selecting an appropriate number of epochs, one can start with a large number of epochs and can stop the training process when no improvement is observed on the selected metric.In this research, we use validation loss as a measure to monitor the performance of the BERT classifier.We concluded that four epochs are an appropriate number to avoid overfitting issues.

Word2vec
Word2vec is an algorithm that is used to generate ''distributed word representations'' inside a dataset (Ali, 2019).In addition, it can generate a vector of a specific length for each word by taking a sentence as input.Word2vec has demonstrated significant performance in similar NLP tasks (Ali & Malik, 2023;Hussain, Malik & Masood, 2022;Younas, Malik & Ignatov, 2023).The skip-gram and continuous bag of words (CBOW) are the two algorithms supported by the word2vec model to generate word embeddings.We are interested in using the skip-gram model to generate embedding features.The skip-gram model tries to predict relevant contextual words for an input word.Window size is another parameter used to confine the number of context words in a frame and we use a window size of 100 dimensions.

TF-IDF
TF-IDF is a statistical approach to evaluate the significance of a particular word in a large context of the document.This technique is commonly used in NLP and information retrieval (IR) tasks (Malik, Imran & Mamdouh, 2023).It is a weighting technique and the weight of a word in a document is proportional to its frequency of occurrence whereas it is also inversely proportional to its frequency in all documents.

EXPERIMENTAL RESULTS AND ANALYSIS
In this section, a description of the dataset and details of the experimental setup are presented.Then results are conducted and analyzed to evaluate the effectiveness of the proposed framework.

Dataset
This study used a benchmark publicly available dataset, i.e., CrisisMMD (Alam, Ofli & Imran, 2018), to test the effectiveness of the proposed framework.The dataset consists of information about seven natural disasters like floods, earthquakes, wildfires, hurricanes, etc.There are seven disaster files in the CrisisMMD dataset.Each disaster contains tweets related to a specific type of event/disaster that occurred at particular/different locations.
The tweet text describes human and infrastructure damages.Originally, the tweets of the dataset had several types of class labels like displaced people, affected individuals, etc.As described earlier, we address damage assessment tweet identification at two levels: binary (damage vs non-damage) and multi-class (infrastructure damage vs human damage vs non-damage) classification.The final labels of tweets are derived as follows: • Infrastructure damage class: In this class ''infrastructure damage, utility damage, vehicle damage & restoration, and casualties'' are combined.
• Human damage class: In this class ''affected individuals, injured or dead people, missing, trapped or found people, displaced people, and evacuations'' are combined.
• Damage class: In this class ''infrastructure and human damage classes'' are combined.
• Non-damage class: Tweets that describe no damage.
After the compilation of the above-mentioned categorization on seven disaster files, the final form of the dataset is described in Table 3.

Experimental setup
We used Python language to calculate the results.Four evaluation metrics are chosen to evaluate the performance of the fine-tuned BERT, comparable models, and five baselines.The metrics are precision, recall, accuracy, and f1-score.In addition, the Wilcoxon signed rank statistical test (Woolson, 2007) is used to determine whether the improvements are statistically significant or not.Six state-of-the-art classifiers are chosen and comparable models are designed to compare the performance of the fined-tuned BERT model.The classifiers are random forest (RF), logistic regression (LR), support vector machine (SVM), CNN, LSTM, and Bi-LSTM.The reason why we chose these ML and DL models is that they presented a significant performance in similar NLP and text mining tasks (Malik et al., 2023;Rehan, Malik & Jamjoom, 2023).The following comparable models are designed: 1. Word2vec+RF 2. Word2vec+LR 3. Word2vec+SVM 4. TF-IDF+RF 5. TF-IDF+LR 6. TF-IDF+SVM 7. TF-IDF+LSTM 8. TF-IDF+Bi-LSTM 9. TF-IDF+CNN Furthermore, to compare the performance of the proposed framework with benchmark studies, we have chosen the following studies from the literature.1. Rudra et al. (2018) used syntactic and low-level lexical features with an SVM model for binary and multi-class classification of damage tweets and evaluated their methodology on the CrisisMMD dataset.This is one of the baselines for comparing fine-tuned BERT performance in binary and multi-class classification tasks.2. The authors in Madichetty & Sridevi (2021) used syntactic, low-level lexical, and topfrequency features with a weighted SVM classifier.Binary and multi-class classification frameworks are designed for the identification of damaged tweets.3. Alam, Ofli & Imran (2020) proposed a system for damage assessment tweets identification and we chose it to compare for binary classification.They used bagof-words features with the RF model.4. Kumar, Singh & Saumya (2019) explored the impact of TF-IDF with several ML and DL models for the identification of damage assessment tweets as a binary classification.The best results are chosen for the comparison. 5. Lastly, Alam et al. (2021) used CNN and FastText embeddings to design a binary classification system for damage tweet identification.We compared this study with the proposed framework for binary classification.

Fine-tunning of BERT and comparison with baselines (damage vs non-damage)
In this section, we performed fine-tuning of BERT for the binary classification task (damage vs non-damage) and then compared its performance with baselines.In the second step, BERT was again trained and validated for four epochs using 128 sequence length and 32 batch size and results are reported in the lower part of Table 4.The training and validation loss is presented in Fig. 4 (right side) .The validation accuracy of BERT classifiers is higher than with 64 sequence length but validation f1-score is decreased.On epoch 2, the best accuracy is achieved, i.e., 93.33%, and the best f1-score is 86.33%.The training loss decreases steadily and converges to 0.08 value.In contrast, validation loss increases continuously from epoch 1 to 4, indicating that further training is not useful.Hence, validation loss has demonstrated a symmetrical pattern (increasing) for both sequence lengths, and training loss is continuously decreasing.
In the third step, we tested the BERT classifiers (previously trained and validated) on the test parts of the seven disasters.For each configuration (64 and 128 sequence lengths), the BERT classifiers are tested each epoch, but we only reported the best results against each sequence length for each disaster.The results are shown in Table 5 and each entry includes a confusion matrix and four metric values (accuracy, precision, recall, and f1-score).For the Iraq-Iran earthquake disaster, the sequence length of 128 presented the best performance (95.96% accuracy and 95.12% f1-score).Likewise, for Sri Lanka floods, 93.94% accuracy and 85.13% f1-score are the best values with a sequence length of 128.The best results are obtained on the Iraq-Iran earthquake disaster and the lowest results are achieved on the Hurricane Harvey disaster.Moreover, the sequence length of 128 presented the best results for the first two disasters, and for the remaining five disasters, the sequence length of 64 produced the best results.
In the fourth step, we compared the effectiveness of the fine-tuned BERT with state-ofthe-art benchmarks (Alam, Ofli & Imran, 2020;Alam et al., 2021;Kumar, Singh & Saumya, 2019;Madichetty & Sridevi, 2021;Rudra et al., 2018) for damage vs non-damage tweets classification.Five benchmarks are compared with a fine-tuned BERT model for each disaster and results are added in Table 6.The fine-tuned BERT outperformed the five benchmarks in all seven disasters.For the Iraq-Iran earthquake, the highest f1-score achieved by the benchmark is 78.48% and the proposed framework demonstrated a 95.12% f1-score.Thus 16.64% improvement is observed in the f1-score.Moreover, fine-tuned BERT demonstrated significant improvement in accuracy, precision, recall, and f1-score in comparison to five benchmarks for seven disasters.Specifically following percentage of improvements are observed in the f1-score; for Hurricane Irma, 5.37%; for  Hurricane Maria, 0.69%; for Hurricane Harvey, 2.61%; for California wildfires, 6.79%; and for Mexico earthquake, 9.54%; This proved the effectiveness of fine-tuned BERT for binary classification of damage assessment tweets and demonstrated better performance than five benchmarks on seven disasters.Among the baselines, the study Madichetty & Sridevi (2021) presented better performance than the other four baselines.
In the fifth step, a statistical test is conducted to determine whether the improvements are statistically significant or not.For this, the performance of the fine-tuned BERT model is compared with the best-performing baseline using Wilcoxon's signed-rank test (Woolson, 2007) to check the statistical significance of improvements.This test is non-parametric and the null hypothesis can be rejected at the α level.The null hypothesis is that the both models have the same performance.The results of the Wilcoxon signed-rank test are added in Table 7.We compared both models using the macro f1-score for each disaster.The fine-tuned BERT outperforms the baseline for six disasters.The null hypothesis can be rejected on the α = 0.05 confidence level.At first, the difference in f1-scores for both models is calculated and the rank is assigned based on absolute difference values.Then, the sum of ranks is calculated following the criteria of adding all positive ranks at one point and adding all negative ranks at another point.We got R + = 7 + 6 + 5 + 3 + 2 + 4 = 27, and R − = 1, where V α = 6.As the minimum sum (i.e., 1) is less than 6, we reject the null hypothesis that both models perform equally.Thus, improvement of fine-tuned BERT is statistically significant for binary classification.

Fine-tunning of BERT and comparison with baselines (infrastructure vs human vs non-damage)
In this section, the BERT model is fine-tuned to perform multi-class classification of tweets into infrastructure damage, or human damage, or non-damage categories.After that, the performance of the fine-tuned BERT model is compared with two benchmarks (Madichetty & Sridevi, 2021;Rudra et al., 2018), and five comparable models.
At first, fine-tuning of BERT is performed using the same parameters described in section 'Fine-tunning of BERT and comparison with baselines (damage vs non-damage)', but here objective is multi-class classification.The 64 and 128 sequence lengths are used with 32 batch size and a learning rate of 2e−5 is used.The outcomes are measured in the  After that, we tested the BERT classifiers on the test part of each disaster and the best results are reported in Tables 9 and 10.In addition, the performance of the fine-tuned BERT model is compared with two benchmarks (Madichetty & Sridevi, 2021;Rudra et al., 2018), and five comparable models.There are nine comparable models but we added results of the top five comparable models.The models are trained on 80% data and then tested on the remaining 20% of the disaster datasets.Table 9 shows the accuracy and per class (infrastructure vs human vs non-damage) f1-score values with macro-average.For the Iraq-Iran earthquake disaster, the fine-tuned BERT outperformed and obtained accuracy (93.88%), precision (85.40%), recall (80.66%), and macro-f1-score (82.63%).The improvement of 6.13% in macro-f1 is obtained by the proposed framework as compared to the benchmark (Madichetty & Sridevi, 2021).Moreover, we can notice the improvement in human damage and non-damage classification presented by the proposed framework as compared to benchmarks and five comparable models.An improvement of 16.20% in human damage and 19.37% in non-damage category classification is observed.Likewise, for macro-precision and macro-recall metrics, it is evident from Table 10 that 4.5% and 5.79% improvement is achieved by the proposed framework compared to the benchmark (Madichetty & Sridevi, 2021).
For the Sri Lanka flood disaster, the proposed framework outperformed the two benchmarks, and five comparable models by demonstrating a 10.33% improvement in macro f1-score, 12.28% in accuracy, 0.53% in macro-precision, and 17.89% in macrorecall measures.Moreover, improvement of 1.25% in infrastructure damage, 17.2% in human damage, and 12.53% in non-damage are observed in the f1-score.For the Mexico earthquake disaster, the proposed framework outperformed all models.An improvement of 6.64% in macro f1-score, 9.2% in macro-precision, and 4.87% in macro-recall are detected compared to benchmarks.Considering the California wildfire disaster, the fine-tuned BERT model delivered the highest performance.We observe an improvement in macro f1-score (12.95%), macro-precision (1.25%), and macro-recall (14.74%).Furthermore, an improvement of 13.16% in human damage, and 26.98% in non-damage classification is obtained by the fine-tuned BERT model.
For the Hurricane Harvey disaster, the proposed framework presented much better performance than state-of-the-art benchmarks and five comparable models.Considering the f1-score, the proposed framework obtained 8.54%, 19.17%, and 6.94% improvements in human damage, non-damage and macro f1-score.Furthermore, 2.33% and 10.89% improvements are observed in macro precision and macro recall.For the Hurricane Maria and Hurricane Irma disasters, the proposed framework performed better than benchmarks and comparable models in accuracy as shown in Table 9. Regarding the f1-score, non-damage classification is improved by 17.7% and 21.07%for Hurricane Maria and Hurricane Irma but did not perform better in the macro f1-score.This shows that fine-tuned BERT is not trained well on these two disaster datasets.This completes the evaluation of the proposed framework on seven disasters for binary classification and multi-class classification.In the end, the statistical test is performed using the Wilcoxon signed-ranked test to check whether the improvements of the proposed model are statistically significant or not for multi-class classification.As the Wilcoxon sined-ranked test applies to two classifiers, therefore fine-tuned BERT model is compared with the second-best performing model, and the results are reported in Table 11.For justifying the improvements to be statistically significant, our analysis should reject the null hypothesis.The performance is compared in macro f1-score for each disaster and fine-tuned BERT outperformed the second-best for five disasters.After calculating the difference in performance, the ranks are assigned using absolute values.Then the sum of ranks is calculated and we got R + = 25 and R − = 3, where V α = 5.By comparing R − and V α , the former is less than the latter and we reject the null hypothesis.Hence, the improvements of the fine-tuned BERT model are statistically significant for multi-class classification.

DISCUSSION AND LIMITATIONS
Assessment of damages and proper coordination of rescue efforts are in high demand for in-time response to disasters.Recently, the emergence of state-of-the-art deep learning technologies attracted the researchers' attention, and robust damage identification models can be developed using these DL techniques and taking benefits of available big datasets.However, the lack of comprehension of the strengths and limitations of these technologies, especially in comparison with traditional ML techniques and deployment issues, requires further investigation.In this research, we propose a tool for damage assessment from textual data, based upon a benchmark BERT transformer model with fine-tuning.This tool supports two levels of identification of damages from tweets: binary and multi-class   classification.One effective application of this tool in rehabilitation and rescue stages would be the utilization of quantitative statistics of damages with some qualitative approaches, to verify how much damages are and how these influence individuals and societies.This study made significant contributions in the domain of damage tweet identification and crisis management in real-time disasters.The fine-tuning of BERT transformer model is studied first time for binary as well as multi-class classification according to our knowledge.The most important contribution is the identification of human and infrastructure damages at the second level.The utilization of contextual semantic embeddings enables us to handle the ambiguity and complexity issues of language used to describe damages.The optimization of hyper-parameters for the fine-tuning process helped us to handle the issues of overfitting and catastrophic forgetting.For binary classification, it outperformed the five benchmarks for all disaster datasets and demonstrated significant improvement in detecting damage assessment tweets.For multi-class classification, it outperformed the two benchmarks, and all comparable models for five disaster datasets and presented the comparable performance for the remaining two disasters.This proves the effectiveness of fine-tuned BERT for damage assessment tweet identification both for binary and multi-class classification.
In the end, the advantages of the proposed framework are summarized as follows: First, the baselines are developed using hand-crafted features but the proposed framework utilized an automated feature generation model to address the issue of damage assessment tweets identification.Second, the language model has the ability to capture the actual context of the language being used (to describe the infrastructure and human damages) in the tweets instead of semantic, syntactic, and frequency-based approaches.Third, on top of everything, the performance improvement achieved by the proposed framework on seven disaster datasets compared to benchmarks is promising.
There are some limitations of this study.Our dataset that contains seven disasters is limited to 18,084 tweets and this dataset (CrisisMMD) covers only seven real-time disasters.The size and comprehensiveness of the dataset can be addressed as a potential avenue for improvement.As our methodology is based on a deep learning paradigm, therefore more larger and comprehensive dataset would definitely improve the performance of identification.Another limitation is that although the proposed model addresses multiclass categorization of tweets, it is still not able to assess the number of damages related to the human and infrastructure categories.Such kind of assessments will be very helpful for rehabilitation organizations to estimate the losses and damages before arrival at disaster locations.Future work can extend the findings of this study to propose a solution for the assessment of the quantity of damages.

CONCLUSION
This research investigated the issue of damage assessment tweet identification as a binary and multi-class classification mechanism.We proposed a ''contextual semantic embeddings'' based model for improving damage assessment tweet identification.To handle the issues of complexity and ambiguity and to support the generalization of the framework, a state-of-the-art language model (BERT) is utilized with fine-tuning important hyper-parameters without relying on basic and hand-crafted features.Moreover, nine comparable models are designed to compare the performance of the proposed framework.Several BERT classifiers are trained, validated, and tested by fine-tuning important hyperparameters.To evaluate the effectiveness of the proposed framework, the CrisisMMD dataset containing seven disasters is considered.The results demonstrated that the finetuned BERT model outperformed all benchmarks and comparable models for binary classification as well as for multi-class classification.Specifically, the proposed framework demonstrated state-of-the-art results by obtaining a macro f1-score of 95.12% for binary classification and a macro f1-score of 88% for multi-class classification.The findings of our study would help disaster management and humanitarian organizations to better manage their rescue activities on time.
In the future, we are interested in exploring fine-tuning language models with adapter mechanisms for similar NLP tasks to reduce the training parameter complexities.The accuracy of results for damage assessment tweet identification could be improved by utilizing some state-of-the-art hybrid methodologies.Furthermore, the explainability of damage assessment tweet identification will be very helpful for rescue and disaster management organizations to manage their services on time.

Figure 4
Figure 4 Training loss and validation loss (damage vs non-damage) (for sequence length of 64 and batch size of 32 (left), (for sequence length of 128 and batch size of 32 (right)).Full-size DOI: 10.7717/peerjcs.1859/fig-4

Figure 5
Figure 5 Training loss and Validation loss (infrastructure vs human vs non-damage) (for sequence length of 64 and batch size of 32 (left)), (for sequence length of 128 and batch size of 32 (right)).Full-size DOI: 10.7717/peerjcs.1859/fig-5

Table 3 Detail of CrisisMMD dataset.
The results of experiments by applying sequence lengths of 64 and 128 and batch size of 32 are presented in Table4.Although we tried eight and 16 batch sizes for these experiments, we obtained the best results with a 32 batch size so we only reported results with 32 batch size.The experiments are conducted using a learning rate of 2e−5, Epsilon of 1e−8, and four epochs.
For each sequence length and epoch; training loss, validation loss, validation accuracy, validation f1-score, and training & validation times are reported.

Table 6 Comparison of fine-tuned BERT classifier with benchmarks (damage vs non-damage).
Notes.The bold values are the highest performances achieved by the proposed model for each disaster.

Table 9
(continued)The bold values are the highest performances achieved by the proposed model for each disaster.

Table 10 (
continued) Notes.The bold values are the highest performances achieved by the proposed model for each disaster.